Overview

At this point in the course, we have introduced you to:

  1. some very basic R programming
  2. maninpulating data.frames using dplyr, ie:
    • filter: extract rows
    • select: extract columns
    • arrange: sort the rows of a data.frame based on entries from one or more columns (optionally with the combination of desc())
    • mutate: add / change columns
  3. plotting “tidy” data using ggplot2
  4. basic concepcts of gene-level differential expression:
    • the data objects (DGEList) we use to store the assay, sample, and gene-level metadata
    • analog of gene-by-gene t-tests for questions of interest
  5. introduction to principal component analysis and how to use it to explore sample similarity among our data, in combination with:
    • ggplot2
    • plotly to make ggplot2 graphics interactive, and even 3d interactive plots

You have also been playng with the R package we have made for this course and some of the helper functions we provide. I will be using those here.

Retrieving Processed RNA-seq Data

We have quantitated and packaged up all of the RNA-seq data for you so that you can retrieve it easily using the mbl_load_rnaseq data. This function returns a DGEList of gene expression data for the organism you are asking for.

By default it will return a data object that has data from both the “may” data generated before you got here, and the “mbl” data you generated here.

Last night I showed you how to run PCA on the “may” and “mbl” from mouse separately. This code block shows you how to run it on all the data together.

First load the data:

library(mbl2018)
library(dplyr)
library(ggplot2)
library(mbl2018)

ym.all <- mbl_load_rnaseq("mouse", dataset = "all")
ym.mbl <- mbl_load_rnaseq("mouse", dataset = "mbl")
ym.may <- mbl_load_rnaseq("mouse", dataset = "may")

ym.all <- mbl_load_rnaseq("mouse", dataset = "all")
ym.all <- calcNormFactors(ym.all)

The "dataset" column in the ym.all$samples table tells you which dataset the data came from:

table(ym.all$samples$dataset)
#> 
#> may mbl 
#>  25  70

Extracing Subsets of Samples

You may want to use a subset of samples, perhaps just the “may” samples, or “knockout” samples. You can extract subsets of samples from this dgelist by testing values that are stored in the $samples columns.

For instance, you might just want the “dataset == "may" samples, or the”genotype == "knockout" samples. You can retrieve those like so:

y.may <- ym.all[, ym.all$samples$dataset == "may"]       # 25 samples
y.ko <- ym.all[, ym.all$samples$genotype == "knockout"]  # 34 samples

PCA plots

The mbl_pca function is a helper function to run PCA on a dataset and return data helpful to you.

Let’s run it on all data:

pca.all <- mbl_pca(ym.all)

The pca.all$data slot has a data.frame with the coordinates of our samples on each of the prinipcal components (“PC1”, “PC2”, etc.), and we have added the sample-level phenotypic data for these (like “source”) so you can use it for plotting.

Let’s take a loook at all the data together. We will color the data by source and shape it by the “dataset” column:

gg.all <- ggplot(pca.all$data, aes(PC1, PC2, color = source, shape = dataset)) +
  geom_point()
gg.all

Recall that we can make ggplots interactive by using the plotly package and calling ggplotly on the plot object. We can use the text aesthetic to customize the elements in the hover/tooltips for each point.

For instance, you can include the sample_id into the hover tooltip to see where he outlier samples you have identified are.

library(plotly)
gg.all <- ggplot(pca.all$data, 
                 aes(PC1, PC2, color = source, 
                     text = paste("dataset: ", dataset, "<br>",
                                  "sample_id: ", sample_id))) +
  geom_point()
ggplotly(gg.all)

Differential Gene Expression Analysis

Please refer to the inst/tutorials/dge-mouse-ko.R script for a lot more exposition on what these steps mean, but I will just write the minimal DGE workflow.

First prepare data for analysis

Note that we were using “voom” before, but now using “voomWithQualityWeights”.

# create design matrix
design <- model.matrix(~ 0 + group, ym.all$samples)

# filter out lowly expressed genes
ymf <- ym.all[filterByExpr(ym.all, design),]

# run voomWithQualityWeights to prep data for linear models. I want to color
# bars by group:
cols <- mbl_create_color_map(ymf$samples$group) # This was not covered during
cols <- cols[ymf$samples$group]                 # the tutorial
vm <- voomWithQualityWeights(ymf, design, col = cols, plot = TRUE)

pcaw <- mbl_pca(vm)
ggplot(pcaw$data, aes(PC1, PC2, color = source, size = sample.weights)) +
  geom_point()

Define the questions you want to ask

Here I setup a linear model to ask two questions:

  1. The difference in expression between “cheek_wildtype” and “cheek_knockout”
  2. The difference in expression between “old” vs “young” cheek wildtype samples

Note that you have to “phrase your questions” using the terms found in the column names of your design matrix (see colnames(vm$design))

# Phrase the questions
cm <- makeContrasts(
  cheek_KO = groupcheek_knockout - groupcheek_wildtype,
  age = groupcheek_wildtype_Old - groupcheek_wildtype_Young,
  tgko = grouptrigeminal_knockout - grouptrigeminal_wildtype,
  levels = vm$design)

# Construct linear model to ask the questions
fit <- lmFit(vm, vm$design)
fit2 <- contrasts.fit(fit, cm)

# Do something about the variance
fit2 <- eBayes(fit2)

Get results for the “Cheek Knockout” Question

res.ko <- topTable(fit2, "cheek_KO", n = Inf)
# View(res.ko) # Take a peak

Get results for the “age” Question

res.age <- topTable(fit2, "age", n = Inf)
# View(res.age) # Take a peak

More Plotting

Scatter and/or Boxplots

Did you find an interesting gene? We provided an mbl_plot_expression function to plot individual genes.

One of the top hits in the cheek_KO comparison is “Krtap20-2”. You can plot it like so:

mbl_plot_expression(vm, "Krtap20-2", "group")

There are a few tweaks you might want to make to this plot.

Maybe you just want to plot the expression of this gene in the samples whose group we tested, ie. just in “cheek_knockout” and “cheek_wildtype”. You can use the values in the vm$targets$group colum to select just those samples and plot:

mbl_plot_expression(
  vm[, vm$targets$group %in% c("cheek_knockout", "cheek_wildtype")],
  "Krtap20-2", "group")

You might be wondering what the third parameter is for (“group”) it tells mbl_plot_expression how to split the expression of this gene across the x-axis. The value you put there must be the name of one of the columns of vm$targets

What if you wanted to plot the expression of this gene, split by “source”

mbl_plot_expression(vm, "Krtap20-2", "source")

If this isn’t enough plotting for you, we provide the mbl_tidy function to take your expression expression dataset and turn it into a data.frame that you can use ggplot with.

vmtbl <- mbl_tidy(vm)

Take a peak at vmtbl to see what it is. Every gene expression observation is in one row of the data.frame, which means there are nrow x ncol rows in this dataset. Each row also has the gene information and the sample information.

The log2 normalized expression of this gene is in the “cpm” column

What we will do here is subset all the data down to those rows that are just from for this gene.

dat <- filter(vmtbl, symbol == "Krtap20-2")

Now we can plot with normal ggplot:

ggplot(dat, aes(x = group, y = cpm)) +
  geom_boxplot() +
  geom_point() +
  theme(axis.text.x=element_text(angle=90, hjust=1)) # I am rotating the x-axis labels

Or you can filter down the data to just the symbol and the You can use ggplot your own way.

dat <- filter(vmtbl, 
              symbol == "Krtap20-2",
              group %in% c("cheek_knockout", "cheek_wildtype"))
ggplot(dat, aes(x = group, y = cpm)) +
  geom_boxplot() +
  geom_point() +
  theme(axis.text.x=element_text(angle=90, hjust=1)) # I am rotating the x-axis labels

Heatmaps

You can use the coolmap function to draw heatmaps using your vm object.

The rownames() of the vm object are ensembl gene identifiers, so we need to use the ids of the genes we are interested in to draw the heatmap.

We will:

  1. take the top 20 gene identifiers found in the res.ko result table
  2. Use those to subset the vm object and pass that into coolmap
library(gplots)
#> 
#> Attaching package: 'gplots'
#> The following object is masked from 'package:stats':
#> 
#>     lowess
genez <- head(res.ko$ens_gene, 20)
coolmap(vm[genez,], mar = c(5,10))

Just like we did in the mbl_plot_expression function, we can select out just the samples we are interested in:

coolmap(vm[genez, vm$targets$group %in% c("cheek_knockout", "cheek_wildtype")],
        mar = c(5,10))

Gene Set Enrichment Analysis

We didn’t cover this, but we can do this later today – having setup all the data in this way, you are only two commands away from insight!

Some public resources: